Take-home_Ex1: Geospatial Analytics for Public Good
Setting the Scene
The increasing digitization of urban infrastructures, such as public transportation and utilities, generates vast datasets using technologies like GPS and RFID. These datasets offer valuable insights into human movement patterns within a city, especially with the widespread deployment of smart cards and GPS devices in vehicles. However, current practices often limit the use of this data to basic tracking and mapping with GIS applications, primarily because conventional GIS lacks advanced functions for analyzing and modeling spatial and spatio-temporal data effectively. Enhanced analysis of these datasets could significantly contribute to better urban management and informed decision-making for both public and private urban transport services providers.
Objectives of Take-home_Ex1
Exploratory Spatial Data Analysis (ESDA) hold tremendous potential to address complex problems facing society. In this study, we are tasked to apply appropriate Local Indicators of Spatial Association (GLISA) and Emerging Hot Spot Analysis (EHSA) to undercover the spatial and spatio-temporal mobility patterns of public bus passengers in Singapore.
The Data
Aspatial data
For this take-home exercise, Passenger Volume by Origin Destination Bus Stops downloaded from LTA DataMall will be used. It contains the number of trips by weekdays and weekends from origin to destination bus stops.
For this exercise, the data is collected in August 2023. The fields are YEAR_MONTH, DAY_TYPE, TIME_PER_HOUR, PT_TYPE , ORIGIN_PT_CODE, DESTINATION_PT_CODE, TOTAL_TRIPS . A sample row for bus dataset could be ‘2023-08, WEEKDAY, 16, BUS, 28299, 28009, 63’. TIME_PER_HOUR of 16 represents data is collected between 4 pm to 5pm.
Geospatial data
Two geospatial data will be used in this study, they are:
Bus Stop Location from LTA DataMall’s static dataset. It provides information about all the bus stops currently being serviced by buses, including the bus stop code (identifier) and location coordinates.
hexagon, a hexagon layer of 250m (this distance is the perpendicular distance between the centre of the hexagon and its edges.) should be used to replace the relative coarse and irregular Master Plan 2019 Planning Sub-zone GIS data set of URA.
The Task
The specific tasks of this take-home exercise are as follows:
Geovisualisation and Analysis
With reference to the time intervals provided in the table below, compute the passenger trips generated by origin at the hexagon level,
Peak hour period Bus tap on time Weekday morning peak 6am to 9am Weekday afternoon peak 5pm to 8pm Weekend/holiday morning peak 11am to 2pm Weekend/holiday evening peak 4pm to 7pm Display the geographical distribution of the passenger trips by using appropriate geovisualisation methods,
Describe the spatial patterns revealed by the geovisualisation (not more than 200 words per visual).
Local Indicators of Spatial Association (LISA) Analysis
Compute LISA of the passengers trips generate by origin at hexagon level.
Display the LISA maps of the passengers trips generate by origin at hexagon level. The maps should only display the significant (i.e. p-value < 0.05)
With reference to the analysis results, draw statistical conclusions (not more than 200 words per visual).
Emerging Hot Spot Analysis(EHSA)
With reference to the passenger trips by origin at the hexagon level for the four time intervals given above:
Perform Mann-Kendall Test by using the spatio-temporal local Gi* values,
Prepared EHSA maps of the Gi* values of the passenger trips by origin at the hexagon level. The maps should only display the significant (i.e. p-value < 0.05).
With reference to the EHSA maps and data visualisation prepared, describe the spatial patterns reveled. (not more than 250 words per cluster).
Getting Started
In this exercise, the following libraries will be used:
sfpackageto perform geoprocessing taskssfdep package which builds upon
spdeppackage (compute spatial weights matrix and spatially lagged variable for instance..)tmap to create geovisualisations
tidyversethat supports data science, analysis and communication task including creating static statistical graphs.knitrto create html tablesDT library to create interactive html tables
ggplot2 to plot graphs
plotly to plot interactive graphs
Importing Data
Aspatial data
Import the Passenger volume by Origin Destination Bus Stops dataset downloaded from the LTA Datamall by using the read_csv() of the readr package.
Check the datafields
Rows: 5,709,512
Columns: 7
$ YEAR_MONTH <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <chr> "04168", "04168", "80119", "80119", "44069", "4406…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "1722…
$ TOTAL_TRIPS <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
Processing the aspatial OD data
The ‘ORIGIN_PT_CODE’ and ‘DESTINATION_PT_CODE’ field is in character field. We will convert it to factor data type.
The function below will extract origin data based on the four time intervals required by the task. The expected arguments are
- daytype: ‘WEEKDAY’ or ‘WEEKENDS/HOLIDAY’
- timeinterval: c(8,10) if we want data from 8am to 11am.
The function will also compute the sum of all trips by ‘ORIGIN_PT_CODE’ for each time interval and stored under a new field called ‘TRIPS’.
Let’s get the data using get_origin function
Take a look at overview of all the four dataframes.
Geospatial data
First, we will import the Bus Stop Location shapefiles into R. The output will be a sf point object with 5161 points and 3 fields. As the raw data is in WSG84 geographical coordinate system, we will convert it to EPSG 3414, the projected coordinate system for Singapore.
busstop <- st_read(dsn="data/geospatial/BusStopLocation/BusStopLocation_Jul2023", layer = "BusStop") %>%
st_transform(crs = 3414)Reading layer `BusStop' from data source
`C:\yixin-neo\ISSS624_AGA\Take-home_Ex1\data\geospatial\BusStopLocation\BusStopLocation_Jul2023'
using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21 / Singapore TM
First 10 features:
BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
1 22069 B06 OPP CEVA LOGISTICS POINT (13576.31 32883.65)
2 32071 B23 AFT TRACK 13 POINT (13228.59 44206.38)
3 44331 B01 BLK 239 POINT (21045.1 40242.08)
4 96081 B05 GRACE INDEPENDENT CH POINT (41603.76 35413.11)
5 11561 B05 BLK 166 POINT (24568.74 30391.85)
6 66191 B03 AFT CORFE PL POINT (30951.58 38079.61)
7 23389 B02A PEC LTD POINT (12476.9 32211.6)
8 54411 B02 BLK 527 POINT (30329.45 39373.92)
9 28531 B09 BLK 536 POINT (14993.31 36905.61)
10 96139 B01 BLK 148 POINT (41642.81 36513.9)
Checking for duplicates in the ‘BUS_STOP_N’ field reveals that there are about 16 repeated bus stop numbers. However, they have different geometry points in the simple feature busstop object above. These could be due to temprorary bus stops . We should retain all these rows as they might have different hexagon ‘fid’ values later.
busstop %>%
st_drop_geometry() %>%
group_by(BUS_STOP_N) %>%
filter(n()>1) %>%
ungroup() %>%
arrange(BUS_STOP_N)# A tibble: 32 × 3
BUS_STOP_N BUS_ROOF_N LOC_DESC
<chr> <chr> <chr>
1 11009 B04 Ghim Moh Ter
2 11009 B04-TMNL GHIM MOH TER
3 22501 B02 Blk 662A
4 22501 B02 BLK 662A
5 43709 B06 BLK 644
6 43709 B06 BLK 644
7 47201 UNK <NA>
8 47201 NIL W'LANDS NTH STN
9 51071 B21 MACRITCHIE RESERVOIR
10 51071 B21 MACRITCHIE RESERVOIR
# ℹ 22 more rows
Take for instance bus stop number 51071 with two different point geometry values.
Simple feature collection with 1 feature and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 28311.27 ymin: 36036.92 xmax: 28311.27 ymax: 36036.92
Projected CRS: SVY21 / Singapore TM
BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
3470 51071 B21 MACRITCHIE RESERVOIR POINT (28311.27 36036.92)
Simple feature collection with 1 feature and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 28282.54 ymin: 36033.93 xmax: 28282.54 ymax: 36033.93
Projected CRS: SVY21 / Singapore TM
BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
3472 51071 B21 MACRITCHIE RESERVOIR POINT (28282.54 36033.93)
Creating hexagon layer
Before we can plot the base layer, we have to create hexagonal grids of 250 m using the busstop points data using sf.
First , create a grid which the extent equals to the bounding box of the busstop points using st_make_grid().
- To create hexagons of 250m (centre to edge), we should input 500 for ‘cellsize’ parameter. ‘Cellsize’ is defined as the distance from edge to edge.
- Convert to
sfobject and add a unique identifier to each hexagon grid. The output has 5580 hexagon units. - Use st_intersects() to count the number of bus stops in each hexagon.
- Filter to retain only hexagons with at least one bus stop. The output
bs_counthas 1524 hexagon units.
area_hex_grid = st_make_grid(busstop,
cellsize= 500,
what = "polygons",
square = FALSE)
# To sf and add grid ID
hex_grid_sf = st_sf(area_hex_grid) %>%
# add grid ID
mutate(grid_id = 1:length(lengths(area_hex_grid)))
# count number of points in each grid
hex_grid_sf$bs = lengths(st_intersects(hex_grid_sf, busstop))
# remove grid without value of 0 (i.e. no points in side that grid)
bs_count = filter(hex_grid_sf, bs > 0)
bs_countSimple feature collection with 1524 features and 2 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 3720.122 ymin: 26193.43 xmax: 48720.12 ymax: 53184.55
Projected CRS: SVY21 / Singapore TM
First 10 features:
area_hex_grid grid_id bs
1 POLYGON ((3970.122 27925.48... 34 1
2 POLYGON ((4220.122 28358.49... 65 1
3 POLYGON ((4470.122 30523.55... 99 1
4 POLYGON ((4720.122 28358.49... 127 1
5 POLYGON ((4720.122 30090.54... 129 2
6 POLYGON ((4720.122 30956.57... 130 1
7 POLYGON ((4720.122 31822.59... 131 1
8 POLYGON ((4970.122 28791.5,... 159 1
9 POLYGON ((4970.122 29657.53... 160 1
10 POLYGON ((4970.122 30523.55... 161 2
Visualisation of our hexagon base map.
We could save this hexagon layer to disk
Geospatial data wrangling
Combining busstop (polygon sf) and bs_count (point sf)
We need to get the corresponding bus stop number of each hexagon in bs_count hexagon layer.
The code chunk below performs points and hexagon polygon overlap and the output will be in sf point object.
Before overlapping:
busstop : 5161 points
hex : 1524 hexagons
After overlapping:
busstop_hex : 5161 points and this is good results because it shows that we are able to retain all the the bus stop data with our bs_count base map.
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21 / Singapore TM
First 10 features:
BUS_STOP_N grid_id bs geometry
3269 25059 34 1 POINT (3970.122 28063.28)
2570 25751 65 1 POINT (4427.939 28758.67)
254 26379 99 1 POINT (4473.292 30681.57)
2897 25761 127 1 POINT (4737.082 28621.29)
2827 25719 129 2 POINT (4799.476 30158.46)
4203 26389 129 2 POINT (4776.694 30324.88)
2403 26369 130 1 POINT (4604.363 31212.96)
1565 26299 131 1 POINT (4879.489 31948.93)
2829 25741 159 1 POINT (5060.733 29212.4)
2828 25711 160 1 POINT (4831.438 30132.43)
We will drop the geometry because busstop_hex is a POINT sf object, there is no hex polygon geometry data for us to plot based on hexagon level. Furthermore, we have to process the attribute data. To get back the hexagon POLYGON geometry data, we can always left_join() bs_count df with our attribute table again later.
We will also use datatable() function of the DT library to print the data. The table is interactive and can perform basic sorting.
busstop_hex <- busstop_hex %>%
st_drop_geometry()
datatable(busstop_hex, class = 'cell-border stripe', options = list(pageLength = 5))Let us check for duplicates in busstop_hex df as it will be used to perform a left join later. The output shows that there are 11 duplicated rows.
# A tibble: 22 × 3
BUS_STOP_N grid_id bs
<chr> <int> <int>
1 22501 1251 8
2 22501 1251 8
3 43709 1904 7
4 43709 1904 7
5 47201 2381 2
6 47201 2381 2
7 11009 2395 7
8 11009 2395 7
9 58031 2939 7
10 58031 2939 7
# ℹ 12 more rows
Removal of duplicates
Check again to be sure. All’s good.
Combining each of the four aspatial dataframes with busstop_hex
The function below performs left outer join for each of the four aspatial origin dataframes with busstop_hex geospatial dataframe. The expected argument is
asp_df: name of aspatial origin dataframe like origin_day_am or origin_day_pm etc..
The function will also rename ‘ORIGIN_PT_CODE’ and ‘fid’ fields to ‘ORIGIN_BS’ and ‘ORIGIN_HEXFID’ respectively.
Use the ‘leftjoin’ function to get our dataframes containing grid_id and origin bus stop ids.
The number of rows increased by 5 after left join , this is due to 5 bus stops, each with two distinct different point geometry coordinates, that originated from the busstop and busstop_hex dataframes earlier.
Again, double check for duplicates
# A tibble: 0 × 3
# ℹ 3 variables: ORIGIN_HEXFID <int>, ORIGIN_BS <chr>, TRIPS <dbl>
# A tibble: 0 × 3
# ℹ 3 variables: ORIGIN_HEXFID <int>, ORIGIN_BS <chr>, TRIPS <dbl>
# A tibble: 0 × 3
# ℹ 3 variables: ORIGIN_HEXFID <int>, ORIGIN_BS <chr>, TRIPS <dbl>
# A tibble: 0 × 3
# ℹ 3 variables: ORIGIN_HEXFID <int>, ORIGIN_BS <chr>, TRIPS <dbl>
FYI: Which are the 5 bus stop numbers with two distinct point geometry coordinates?
[1] "53041" "52059" "68091" "68099" "67421"
Since we are plotting the number of passenger trips generated by hexagon level, we should aggregate the total trips by ‘ORIGIN_HEXFID’ and store these values in a new field called ‘TTRIPS’.
After the left join, the total distinct of hexagons for each time intervals are 1491, 1489, 1499 and 1444. The total hexagon in bs_count was 1524. This suggests that were some bus stops with no passengers at different time intervals.
get_ttrips <- function(df) {
result<- df %>%
group_by(ORIGIN_HEXFID) %>%
summarise(TTRIPS = sum(TRIPS)) %>%
ungroup()
return(result)
}
origin_data_day_am_hex <- get_ttrips(origin_data_day_am) # 5010 to 1491 rows
origin_data_day_pm_hex <- get_ttrips(origin_data_day_pm) # 4970 to 1489 rows
origin_data_end_am_hex <- get_ttrips(origin_data_end_am) # 4999 to 1499 rows
origin_data_end_pm_hex <- get_ttrips(origin_data_end_pm) # 4627 to 1444 rowsPrint out the above four dataframes
Retrieve hexagon geometry coordinates
In order to plot choropleth maps, we need the geometry data from bs_countsf polygon object.
The function below performs a left join with bs_count and the attributes dataframes and also assign 0 TTRIPS value to a hexagon without any passengers.
get_hexgeo <- function(df) {
result <- left_join(bs_count, df,
by = c('grid_id' = 'ORIGIN_HEXFID')) %>%
mutate(TTRIPS = if_else(is.na(TTRIPS),0, TTRIPS))
return(result)
}
day_am <- get_hexgeo(origin_data_day_am_hex)
day_pm <- get_hexgeo(origin_data_day_pm_hex)
end_am <- get_hexgeo(origin_data_end_am_hex)
end_pm <- get_hexgeo(origin_data_end_pm_hex)Choropleth maps
In this section , the choropleth maps will show the geographical distribution of the passenger trips by origin. However, we need to plan for the classification for the maps. We could check the distribution of the ‘TTRIPS’ field across ALL four time intervals.
FIrst, combine all the ‘TTRIPS’ and create a new column ‘source’ to retain the name of the time interval.
Plot boxplots to get a general sensing of the distribution of ‘TTRIPS’ across different time intervals.
ggplot(data=combine,
aes(y=TTRIPS,
x=source)) +
geom_boxplot() +
geom_point(stat="summary",
fun.y="mean",
colour ="red",
size=2) +
labs(x = "Time Intervals", y = "Total Trips", title='Total origin trips by time intervals') +
theme_minimal() +
theme(legend.key.size = unit(0.5,'cm'),
legend.position="bottom",
plot.title = element_text(size = 12,
face='bold'),
axis.title = element_text(size = 11 , face = "bold"),
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = 0, hjust = 1))
From the chart above, both the weekday ridership and the number of bus stops utilised severely outweighs the weekend period.
Final check on the summary statistics of TTRIPS before setting custom break points.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 235.5 1519.0 7355.1 7587.0 337056.0
Before of the large number of outliers bus stops, let us calculate the upper bound of the outliers.
Q1 <- summary(combine$TTRIPS)[2]
Q3 <- summary(combine$TTRIPS)[5]
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
lower_bound 1st Qu.
-10791.75
3rd Qu.
18614.25
With reference to the box plor, summary statistics and upper bound values,
Min : 1 , Max : 447,968
break points can be 230, 1520, 7350, 18600
break vector is thus c(0, 250 1500, 7600, 18600, 337056)
The last interval would set the outliers (bus stops with very high passengers) apart.
plotmap <- function(df, title) {
result <- tm_shape(df)+
tm_fill("TTRIPS",
breaks = c(0, 250, 1500, 7600, 18650,447968),
#style = "quantile",
palette = "Blues",
alpha= 0.7,
#legend.hist = TRUE,
#legend.is.portrait = TRUE,
#legend.hist.z = 0.3,
title = "Passengers Trip") +
tm_layout(main.title = title,
main.title.position = "center",
main.title.size = 2,
legend.height = 0.65,
legend.width = 0.55,
#legend.outside = TRUE,
#legend.text.size= 0.6,
#inner.margins = c(0.01, 0.01, 0, .15),
#legend.position = c("right", "top"),
#bg.color = "black",
#main.title.color = 'white',
#legend.title.color = 'white',
#legend.text.color= 'white',
bg.color = "mintcream",
frame = TRUE) +
tm_borders(alpha = 0.5) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: Passenger Bus origin and destination data and Bus Stop Location data from LTA datamall",
position = c("left", "bottom"))
return(result)
}Map of Weekday-Morning-Peak Passenger Trips by Origin
Map of Weekday-Afternoon-Peak Passenger Trips by Origin
Map of Weekend-Morning-Peak Passenger Trips by Origin
Map of Weekend-Afternoon-Peak Passenger Trips by Origin
Describe the spatial patterns revealed by the geovisualisation
